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Abstract 

The partition function pertaining to finite-temperature decoding of a (typical) randomly 
chosen code is known to have three types of behavior, corresponding to three phases in the 
plane of rate vs. temperature: the ferromagnetic phase, corresponding to correct decoding, the 
paramagnetic phase, of complete disorder, which is dominated by exponentially many incorrect 
codewords, and the glassy phase (or the condensed phase), where the system is frozen at mini- 
mum energy and dominated by subexponentially many incorrect codewords. We show that the 
statistical physics associated with the two latter phases are intimately related to random coding 
exponents. In particular, the exponent associated with the probability of correct decoding at 
rates above capacity is directly related to the free energy in the glassy phase, and the exponent 
associated with probability of error (the error exponent) at rates below capacity, is strongly 
related to the free energy in the paramagnetic phase. In fact, we derive alternative expressions 
of these exponents in terms of the corresponding free energies, and make an attempt to obtain 
some insights from these expressions. Finally, as a side result, we also compare the phase di- 
agram associated with a simple finite-temperature universal decoder, for discrete memoryless 
channels, to that of the finite-temperature decoder that is aware of the channel statistics. 

Index Terms: random coding, free energy, partition function, random energy model (REM), 
phase transitions, error exponents. 
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1 Introduction 



In the last few decades it has become apparent that many problems in Information Theory, and 
the channel coding problem in particular, can be mapped onto (and interpreted as) analogous 
problems in the area of statistical physics of disordered systems (such as spin glass models). Such 
analogies are useful because physical insights, as well as statistical mechanical tools and analysis 
techniques (like the replica method), can be harnessed in order to advance the knowledge and the 
understanding with regard to the information-theoretic problem under discussion. A very small, 
and by no means exhaustive, sample of works along this line includes references [1]— [29]. 

In this paper, we shall also adopt the statistical mechanical viewpoint on channel coding. We 
focus on the classical random code ensemble (RCE) for communicating over a discrete memoryless 
channel (DMC), in the same setting as described in [19, Chap. 6] and [23], which in a nutshell, is 
as follows: Consider a DMC, P(y\x) = Y\a=i P{Ui\ x i)i fed by an input n-vector that belongs to a 
codebook C = {x\, X2, ■ ■ ■ ,xm}, M = e nR , with uniform priors, where R is the coding rate in nats 
per channel use. The induced posterior, for x G C, is then: 

e -\n[i/P{y\x)\ 

Zx>eC^ Hl/P{ylX ' )V (1) 
Here, the second line is written in a form that resembles the Boltzmann distribution of statistical 
physics, according to which the probability of a certain 'state' (or 'configuration') of the system, 
designated by x, is given by 

n 

where (3 = l/(kT) is the inverse temperature, k is Boltzmann's constant]^] T is temperature, 6(x) 
is the energy associated with x, and Z(/3) = e~@ s ( x ^ is the partition function. In our case, 
of course, j3 = 1 and the energy function (which depends on the given y) is £ (x) = 1n[l/P(y\x)]. 
But this analogy with the Boltzmann distribution ([2]) naturally suggests (cf. e.g., [19J) to consider, 



Here we will adopt the convention, customarily used in many papers and books, of redefining 'temperature' 
according to T <— kT, that is, in units of energy, and then /3 = 1/T. 



2 



more generally, the posterior distribution parametrized by (3, that is 



pP{y\x) 



Y, x >ecP^y\*') 



e -/9in[i/P(y|aj)] 



A 



Ex'GC e ~ /31n[1/P(y|a; ' )1 
e -f31n{l/P(y\X)] 



(3) 



There are a few motivations for introducing the temperature parameter in ([3]). First, it allows 
a degree of freedom in case there is some uncertainty regarding the channel noise level (small f3 
corresponds to high noise level). Second, it is inspired by the ideas behind simulated annealing 
techniques: by sampling from Pp while gradually increasing (5 (cooling the system), the minima of 
the energy function (ground states) can be found. Third, by applying symbolwise MAP decoding, 
i.e., decoding the I— th symbol of x as argmax a P^(x^ = a\y), where 



we obtain a family of finite-temperature decoders (originally proposed by Rujan [26J; see also [3J, 
|191 Section 6.3.3],|29J,|27J) parametrized by (3, where (3 = 1 corresponds to minimum symbol error 
probability (with respect to the true channel) and j3 — > oo corresponds to minimum block error 
probability. Finally, and this is the motivation that drives the research reported in this paper: the 
corresponding partition function, Z(J3\y), namely, the sum of (conditional) probabilities raised to 
some power (3, is an expression frequently encountered in Renyi information measures as well as in 
the analysis of random coding exponents using Gallager's techniques. Since the partition function 
plays a key role in statistical mechanics, as many physical quantities can be derived from it, then it 
is natural to ask if it can also be used to gain some insights regarding the behavior of random codes 
at various temperatures and coding rates. The main contribution of this paper is in exploring this 
direction. 

To sharpen the last point a little further, it is noted that when one considers the random 
coding regime, as we do in this paper, then even if y is given, the energy levels pertaining to the 
Boltzmann distribution ([3]) are themselves random variables since they depend on the randomly 
chosen codevectors. As explained in [19], this then falls under the umbrella of the so called random 
energy model (REM) in statistical physics, which was invented by Derrida [30J with the motivation 





XeC: X(=a 
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to capture disorder in spin glass systems. The interesting fact about the REM is that it is typically 
subjected to phase transitions, and then so is the model ([3]) for random codes. 

More specifically, as described in |19} Chap. 6], [25], and as will be briefly reviewed in the next 
section, the partition function pertaining to finite-temperature decoding of a (typical) randomly 
chosen code is known to have three types of behavior, corresponding to three phases in the plane 
of rate vs. temperature: the ferromagnetic phase, corresponding to correct decoding, the paramag- 
netic phase, of complete disorder, which is dominated by exponentially many incorrect codewords, 
and the glassy phase (or the condensed phase), where the system is frozen at minimum energy and 
dominated by subexponentially many incorrect codewords. We show that the statistical physics 
associated with the two latter phases are intimately related to random coding exponents. In par- 
ticular, the exponent associated with the probability of correct decoding at rates above capacity is 
directly related to the free energy in the glassy phase, and the exponent associated with probability 
of error (the error exponent) at rates below capacity, is strongly related to the free energy in the 
paramagnetic phase. In fact, we derive alternative expressions of these exponents in terms of the 
corresponding free energies, and make an attempt to obtain some insights from these expressions. 

An additional interesting byproduct of the statistical mechanical point of view that we adopt in 
this work, is that it suggests a more refined analysis technique, as an alternative to the customary 
use of Jensen's inequality, for which it is clear that the resulting expressions are exponentially tight, 
and not just bounds. Another way to look at this is to observe that the analysis technique, inspired 
by statistical mechanical point of view, provides us with insights with regard to the conditions under 
which Jensen's inequality provides a tight bound in this context. We believe that this technique 
may be useful in other applications as well. We shall elaborate more on this in the sequel. 

As a side result, we also compare the phase diagram associated with a certain universal decoder 
(namely, the minimum conditional entropy universal decoder) for discrete memoryless channels, to 
that of the finite-temperature decoder that is aware of the channel statistics, and show that in 
spite of the fact that this universal decoder is asymptotically optimum, in the sense of attaining 
optimum random coding error exponents [31], its phase diagram is substantially different. 

The outline of the remaining part of this paper is as follows. In Section 3, we provide some 
background, which mostly follows the presentation in [19] (with a few missing details filled in), but 
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will be useful here to keep this paper self contained. Section 3 also includes a subsection with the 
phase diagram for universal decoding, as described in the previous paragraph. In Section 4, we 
derive the alternative formula for the exponent of correct decoding above capacity, and in Section 
5, we do the same regarding the random coding exponent at rates below capacity. 

2 Notation Conventions, Background and Preliminaries 

2.1 Notation Conventions 

Throughout this paper, scalar random variables (RV's) will be denoted by capital letters, like X 
and Y, their sample values will be denoted by the respective lower case letters, and their alphabets 
will be denoted by the respective calligraphic letters. A similar convention will apply to random 
vectors and their sample values, which will be denoted with the same symbols in the boldface font. 
Thus, for example, X will denote a random n- vector (Xi, . . . , X n ), and x = (xi, ... specific 
vector value in X n , the n-th Cartesian power of X. 

Sources and channels will be denoted generically by the letters P and Q. Specific letter prob- 
abilities corresponding to a source Q will be denoted by the corresponding lower case letters, e.g., 
q(x) is the probability of a letter x G X. A similar convention will be applied to the channel P and 
the corresponding transition probabilities, p(y\x), x G X, y G y. The expectation operator will be 
denoted by E{-}. 

The empirical distribution pertaining to a vector x G X n will be denoted by Px- In other 
words, Px = {Px(a), a G X}, where px(a) = n x (a)/n, nx(a) being the number of occurrences of 
the letter a in a;. Similar conventions will apply to empirical joint distributions of pairs of letters, 
(a, b) G X x y, extracted from the corresponding pairs of vectors (x, y), that is, the joint empirical 
distribution P X y is the vector of relative frequencies of joint occurrences of X{ = a and yi = b, i = 
1, . . . , n. Similarly, Px\y( a \b) = Pxy(a, b)/fty(b) will denote the empirical conditional probability of 
X = a given Y = b (with convention that 0/0 = 0), and Px\y win denote {px\y ( a l^)i ct €. X, b G 
The expectation w.r.t. the empirical distribution of (x,y) will be denoted by Exy{-}, i.e., for a 
given function / : X x y -> M, we define E X y{f(X,Y)} as J2( a ,b)eXxy Vxy (a, b)f(a, b), where 
in this notation, X and Y are understood to be random variables jointly distributed according to 
Pxy- 
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The cardinality of a finite set A will be denoted by \A\. For two positive sequences {a n } and 
{b n }, the notation a n = b n means that a n and b n are asymptotically of the same exponential 
order, that is, linin^oo - In j 1 - = 0. Information theoretic quantities like entropies and mutual 
informations will be denoted following the usual conventions of the Information Theory literature. 
When we wish to make it clear that such an information theoretic quantity is induced by a certain 
probability distribution, say Q, we use this probability distribution as a subscript, e.g., Iq(X;Y), 
Hq(X\Y), etc. When the underlying probability distribution is an empirical distribution, we will 
subscript it by the sequences(s) from which the empirical distribution is extracted, and we will use 
hats, e.g., Ixy(X;Y), H xy {X\Y). 

2.2 Background and Preliminaries 

Consider a DMC with a finite input alphabet X and a finite output alphabet 3^, which when fed 
by an input vector x € X n , it generates an output vector y € y n distributed according to 

n 

p {y\ x ) = Wp{y'i\xi), 

i=l 

where {p(y\x), x S X, y £ y} are given single-letter transition probabilities. Let 

C = {x 1 ,x 2 ,...,x M } C X n 

be a codebook of M = e nR codewords, where R is the coding rate (in nats per channel use). Next 
consider the posterior distribution (J3j) and the corresponding partition function 

zmy) = Y. p ^y\ x ) = Y. e ^ d{x,y) i ( 4 ) 

xec xec 

where d(x,y) = — lnP(y\x) = — X^=i ^ n P(yi\ x i)- We shall think of Z(f3\y) as the sum of two 
contributions, the first is Z c {j3\y) = e~^ d ^ x °'y\ pertaining to the correct codeword xq (that was 
actually transmitted across the channel), and the second is associated with the remaining (incorrect) 
codewords, 

z e (/3\v)= E tW>v\ 

xec-{x } 

Let us focus on Z e ((3\y) first. As mentioned in the Introduction, when the codebook C is selected at 
random, this is a disordered system in the framework of the REM, which exhibits phase transitions. 
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To describe these phase transitions, it is instructive to begin with the relatively simple special 
case of the binary symmetric channel (BSC), as we do in Subsection 2.2.1, and then extend the 
scope to general DMC's, as in Subsection 2.2.20 Finally, Subsection 2.2.3 (which is not included 
in |19j ) is about a phase diagram pertaining to universal decoding (cf. second to the last paragraph 
of the Introduction). This subsection can be skipped without loss of continuity. 

2.2.1 The Binary Symmetric Channel 

For the BSC with a crossover parameter p, we have P(y\x) = p dH (. x >y>(l — p^ n - d H{x,y) ^ w h ere 

dtf(x, y) is the Hamming distance between x and y. Defining B = In -^2, we then have P(y\x) = 
(1 - p )n e -Bd H (X,y)^ and go 

z e (p\v) = E^M*) 

xec 

= (1 - pf n ^ e -PBd H (x,y) 
xec 

n 



(l-pf n ^N y {d)e- pBd , (5) 



d=0 



where Ny[d) is the number of incorrect codewords at Hamming distance d from y. As argued in 
|19j . when the codewords are chosen independently at random (say, by fair coin tossing), {Ny(d)} 
concentrate very rapidlyjfl as n — ► oo, about their expectations: 

E{Ny(6n)} = e n ^ R - ln2+h ^\ < 5 < 1 (6) 

where h(5) = —5 In S — (1 — 8) ln(l — S). Defining the normalized Gilbert-Varshamov (GV) distance, 

5gv(R)-> as the solution, 5, to the equation h(5) = In 2 — R, it is apparent that for 5 < 5gv{R) and 

5 > 1 — Sgv{R), E{Ny(5n)} has a negative exponent, and thus typically, these distances are not 

populated by codewords. Therefore, for a typical random code, 

r l-6 GV (R) 

Z e (f3\y) = (1 - Vf n ■ e< R -^ / d5 ■ e nh ^ ■ e^ BS 

J&Gv) 

= {l-pf n -e^-^e^in- max \h(S) - f3BS}\ 

I Se[5 GV (R),l-5 G v(R)] J 

e -n/3F e (/3) ^ 



A 



2 This extension to general DMC's in outlined in [191 Chap. 6], but here we provide some more details. 

3 Note that Ny(d) = l{dH(xi,y) = d}, i.e., it is the sum of exponentially many i.i.d. (given y) random 

variables, and so, its large deviations behavior is exponential in e nR , which is double-exponential in n (see also 
Appendix, Subsection A. 2.). 
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where F e (f3) is the free energy density associated with the incorrect codewords, which is given by 

p/m _ f S Gv(R) In \ + (1 - S GV (R)) In ^ P/3 < <5 G v(i?) (R , 
* eW \^[\n2-R-\n{pP + {l-pf)] p p >5 GV {R) W 



where 



and where the distinction between the two different expressions is due to the constraint 5 € 
[5 G y(R),l — ^gv(R)]i which becomes active (i.e., achieved with equality) when p@ < 5 G y{R). 
We observe then that when p and R are held fixed, and (3 varies, the above expression exhibits a 
phase transition at temperature T C (R) = 1/ (3 C (R) for which pp = S G y(R), i-e., 

_ ln[(l -5 GV (R))/5 GV (R)} 

lc{) ~ M(i-p)/p] 

For (3 > [3 C (low temperature), the free energy density i* 1 ^/?) = S G y(R) In ^ -h (1 — 5 G y (R)) In is 
independent of (5 hence the entropy (which is related to the derivative of F e ((3) w.r.t. (3) vanishes, 
and the system is frozen in the sense that the thermodynamics are dominated by a sub exponential 
number of configurations of the minimum energy which is n5 G y{R). This phase is referred to as 
condensed phase or glassy phase, and henceforth we denote 

F g = 6 GV (R)ln- + {l-S GV (R))ln- 1 



p 1 — p 

For (3 > {3 C , the thermodynamics are dominated by an exponential number of states at distance npp, 
which is larger than n5 G y{R), and the entropy is strictly positive. This is called the paramagnetic 
phase and henceforth we denote 

F p (/3)^i[ln2-i?-ln(/ + (l-p)' 3 )]. 

When the contribution of Z c {(3) = e~ nl3Fc is taken into account, and we consider the total 
partition function Z(f3), the situation changes: Since dnixQ^y) is typically about the level of np, 
and thus the corresponding free energy density is F c = h(p), we have yet another phase referred to 
as the ordered phase or the ferromagnetic phase. This phase exists whenever Z{0) is dominated by 
Z c ((3), i.e., F c = h(p) < F e {0). For (3 > (3 C , this is the case whenever p < S G y(R), or equivalently, 
R < In 2 — h{p) = C, where C is the capacity of the BSC. For (3 < [3 C the boundary between the 
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ferromagnetic phase and the paramagnetic phase is given by the solution Po(R) = 1/Tq(R) to the 
equation 

f3h(p) = ln2-ii-ln[/ + (l-p) /3 ]. (9) 

To summarize, while there are only two phases (glassy and paramagnetic) pertaining to Z e {(5), 
there is a third, additional phase (ferromagnetic) associated with Z c {(3). In the ferromagnetic phase, 
the system is dominated by one state corresponding to the correct codeword. Thus, similarly as 
in the glassy phase, the entropy of the ferromagnetic phase is zero. The boundaries between the 
three phases in the plane defined by R and T = l/f3, are as follows (see Fig. 1): The ferro-glassy 
boundary is the straight line R = C, the glassy-paramagnetic boundary is the curve T = T C (R), 
and the and the ferro-paramagnetic boundary T = Tq(R) is given by eq. ([9]). The triple point 
where all boundaries intersect is the point (R,T) = (C, 1). 

T = l/0 



T = T (R) 



paramagnetic 




R 



Figure 1: Phase diagram of the finite-temperature MAP decoder. 

In spite of the fact that in the glassy phase there are only few configurations that dominate 
the behavior, it is no different from the paramagnetic phase in terms of the typical ranking of the 
likelihood of the correct codeword among all codewords: In both phases, the typical location of 



9 



the correct codeword in the list of descending likelihoods, {P(y\xi)}, is about 2 n ( R ~ c ^ (R > C). 
Although the glassy phase exhibits less uncertainty, or equivalently, more certainty, (sublinear 
conditional entropy given y about the channel input), this relative certainty is misleading because 
the posterior probability mass is captured mostly by incorrect codewords. In this sense, the glassy 
phase is even more problematic than the paramagnetic one: Since the certainty is fictitious, it is 
more difficult to detect errors. 

2.2.2 Extension to General DMC's 

The extension to general DMC's is essentially quite straightforward. Consider a DMC parametrized 
by {P(y\x), x £ X, y € y}. For the sake of simplicity, let us consider the uniform random 
coding distribution^ according to which each codeword is selected independently at random with 
probability distribution Q(x) = 1/\X \ n for all x £ X n . For a given channel output vector y, 
the probability of selecting a random codeword x whose conditional empirical distribution with 
y is Px\y ls °f the exponential order of e "I 111 ^ H xy( x \ Y )} ^ 3 thus the expected number of 
codewords with this conditional distribution is exponentially 

E{N y (P x]y )} = e »l*-*\x\+*xyWn 

In analogy to the explanation provided in the previous subsection (and in [H]), in the context of 
the BSC, those conditional distributions {P x \y} for which the exponent on the right-hand side is 
negative, are typically not populated. Thus, for a typical random code 

z e (p\v) = E pP (v\ x ) 

X&C-{X } 

e -P^i/P(y\x) 

xgc-{x } 

= E Ny(P xly )-eM-PE X yln[l/P(Y\X)}} 
{Px\y} 

= expjn(i?-ln|*|+ max \H Q (X\Y) - (3E Q {ln[l/P(Y\X}}] ) \ 

\ Qx\Y- H Q (X\Y)>ln\X\-R I 



A 



e -n/3F e (f3;Y) f (1Q) 



4 Other random coding distributions can be used as well, but will lead to somewhat more complicated expressions, 
which we prefer to avoid in this description. 



10 



where Y designates a RV distributed according to the empirical distribution Py of y. 

A word on notation is now in order: here and throughtout the sequel, we adopt the common 
abuse of notation, customarily used in the Information Theory literature, that when a RV appears 
as an argument or a subscript of a certain function, this means that it is actually a functional of 
its distribution, not a function of the value of the random variable itself. Whenever we wish to 
emphasize the dependence of this quantity on the empirical distribution Py, we will replace Y by 
Py or simply by , itself, provided that the context does no, leave room for ambignity. Similar 
comments will apply to other quantities to be defined throughout this subsection and in the sequelo 
For some of these quantites, we will not denote the dependence on the distribution of Y explicitly, 
in order to avoid cumbersome notation, but it will be made clear that they do depend on it in 
general. 

Consider now the expression 

Jy(P,R)= max [H Q (X\Y) - 0E Q {d{X,Y)}] , 

Q x]Y :H Q {X\Y)>\n\X\-R 

where d(x,y) = — lnp(y\x). 

First, it is easy to prove (see Appendix, Subsection A.l) that for fixed (3 and y, the function 
Jy(/3, R) is concave in R. This means that the inequality constraint Hq(X\Y) > In \X\ — R is met 
with equality as long as R < Ry(/3), where Ry(P) = In \X\ — HqAX\Y) with Qp being the achiever 
of 

MPM\X\) = max[H Q (X\Y) - /3E Q {d(X,Y)}], 

Qx\Y 

that is, 

e-^'f) _ P p {y\x) 



We will also use the notation 



and 



D Y ((3) = E Q0 {d(X,Y)} 
H Y ((3)=H Q0 (X\Y), 



5 In Subsection 2.2.1, this issue did not arise since all relevant quantities happened to be independent of Py, due 
to the symmetry of the BSC. 
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thus Ry(/3) = In \X \ - H Y (0). Let 

(3 C (R) = M{p : R Y (P) >R} = mf{0 : H Y (/3) < In \X\ - R}. 

Obviously, C (R) increases with R, or equivalently, T C (R) = 1/ C (R) is decreasing with R (T c (ln \X\) - 
0). This forms the boundary curve between the glassy and the paramagnetic phases. Note that 
when R = I(X;Y), the mutual information induced by the uniform distribution on X and by 
P(y\x), then C (R) = 1. Thus, (I(X; Y), 1) is a point on the curve T = T C (R). 

For R < Ry((3), or equivalently, > C (R), the constraint Hq(X;Y) > In \X\ — R is attained 
with equality. Thus, in this range of low rates, 

J Y (P,R) = max [In 1*1 - R - (3E Q {d(X,Y)}} 

= hx\X\ -R-3- min En{d(X,Y)} 

{Q x \Y-H Q (X\Y)=\n\X\-R} 

= ln\X\-R-PD Y (0 R ) (11) 

where (3r is the solution to the equation H Y {(3) = \yi\X\ — R. We will also use the notation 
S Y (R) = D y (Pr)^ It follows then that F e (/3,Y) = F g {Y) = 5 Y (R), which is the glassy phase. 

For R > R Y (J3), 

J Y (P,R) = J Y (PM\X\) = mBx[H Q (X\Y) - pE Q {d(X,Y)}] = H Y ((3) - (3D Y ((3) 

Qx\Y 

Thus, for < p c (R), 

F e (0,Y) = F p (0,Y) = D Y (P) + ^\*\-R-ByW , 

which is the paramagnetic phase. It should be pointed out that for a general decoding metric 
d(x,y) (not necessarily ML matched to the channel), the boundary between the paramagnetic and 
the glassy phases depends only on the random coding distribution and this decoding metric d(x, y), 
not on the channel itself (cf. Subsection 2.2.3). The boundaries with the ferromagnetic phase are 
the ones that depend on the channel. 

In the ordered (ferromagnetic) phase, the free energy density is given by F{0) = H(Y\X), where 
X is uniform and Y given X is distributed according to the channel. As long as R < I(X;Y), 



6 The quantity 5y(R) is the generalization of the GV distance that was defined in Subsection 2.2.1. for the BSC. 
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we have H(Y\X) < 5y(R)- In fact, the line connecting the points (R = I(X;Y),T = 1) and 
(i? = I(X; Y), T = 0) forms the boundary between the ordered ferromagnetic phase and the glassy 
phase. 

For R < I(X;Y), the boundary between the ferromagnetic and paramagnetic phases is given 
by the solution fio(R) (or Tq(R) = 1/(3q(R)) to the equation 

0H(Y\X) = I3D Y {I3) + In \X\ - R - H Y (J3), 

which is above the curve T = T C (R) for R < I(X; Y). It should be emphasized that C (R), /3q(R), 
and (3r all depend on the (distribution of the) RV Y, namely, the empirical distribution of y. 

2.2.3 Phase Diagram for Universal Decoding 

It is instructive to compare the phase diagram of finite-temperature MAP decoding to those of 
finite-temperature universal decoders. One simple example of a universal decoder for which it 
is especially easy to derive the phase diagram is the minimum conditional entropy decoder [31], 
which given y, selects the codeword x m for which Hx m y{X\Y) is minimum j| It is well known that 
this universal decoder is asymptotically optimum in the random coding sense, in that it achieves 
the same random coding error exponent as the ML decoder, provided that the random coding 
distribution is uniform over X n . 

The partition function corresponding to this universal decoder is the same as before, except 
that E X y {d(X, Y)} is replaced by Hxy(X\Y). In this case, 

Z e (J3\v) = E N y( p x\y)-e-^ x y {XlY) 
{Px\y} 

= exp J n [ R-]n\X\ + max [(1 - p)H Q (X\Y)] ) \ 

A e -nPF e (/3,Y) ( 12 ) 

Now, it is easy to see how phase transitions behave (see Fig. 2): If (3 < 1, then the maximum is 
In \X | and we get 

Z e ((3) = e n[R ~ l3lnlXl] , 



7 This is a variant of the well-known maximum mutual information (MMI) decoder. In the case of constant 
composition codes, these two decoders are identical. 
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thus F e ((3, Y) = F p ((3) = In \X\ - R/f3. If p > 1, we get 

Z e (f3) = e- n W»W- R \ 

thus, F e {(5, Y) = F g = In |Af | — R. Therefore, the boundary between the two phases is the horizontal 
line T c = 1 / p c = 1 (independently of R). This means that the glassy region here is larger than in ML 
decoding for R> C. The boundary between the ferromagnetic and the glassy phases continues to be 
R = I(X; Y) as before. The ferromagnetic-paramagnetic boundary is now H(X\Y) = In \X\ — R/(3, 
or, equivalently, T = 1/(3 = I(X;Y)/R, which is below the ferromagnetic-paramagnetic boundary 
of the MAP decoder. This can easily be shown by setting R = (3I(X; Y) (which is this boundary) 
in the r.h.s. of the equation defining Tq{R) and showing that the resulting expression is larger than 
(3H(Y\X) (for (3 < 1), which is the l.h.s. of this equation (thus, we are still in the ferromagnetic 
phase of MAP decoding): Specifically, the l.h.s. of the equation defining Tq(R) is: 

(3D Y ((3) +ln\X\ -R-Hy(P) 
= (3D Y ((3)+ln\X\-(3I{X;Y)-H Y ((3) 

= PH{Y\X) + pE Q(j In -p^y + In \X\ - PH{Y) - H Q[i (X\Y) 

= p H (Y\X)+(3E Qp ln^— ) -PH(Y)+I Q0 (X;Y) 

> 0H(Y\X) + 0H Qp (Y\X) - (3H{Y) + I Qp (X; Y) 

> pH(Y\X)-l3I Qp (X;Y)+I Qp (X;Y) 

> 0H{Y\X) (13) 

where the first equality is since R = (3I(X; Y) on the boundary, and the last equality is since (3 < 1. 
Thus, although this decoder achieves the optimum random coding error exponent, it has a phase 
diagram which is worse than that of MAP decoding, as the ferromagnetic region is smaller and the 
glassy region is larger. 

3 The Correct Decoding Exponent 

We now proceed to establish relationships between the phase diagram of a random code, decoded 
by a finite temperature MAP decoder, and the exponent of correct decoding at rates above capacity, 
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T = l//3 




I(X;Y) 

Figure 2: Phase diagram for universal decoding. 



R 



or to be more precise, rates above I(X;Y), the mutual information induced by the uniform input 
distribution and the channel. 

Arimoto [33j begins the derivation of his bound on the probability of correct decoding by using 
the inequality 



Pr 



Emax P(y\xi) < — 
i<i<M vy| ' ~ M ^ 

yey 



M 



i=i 



[3>0 



(14) 



which becomes tight when (3 — > oo. We will also use this inequality, but we shall proceed somewhat 
differently than Arimoto. First, observe that for a randomly selected code, where the average 
probability of correct decoding is upper bounded by 



/ > < J_ y E 



yey™ 



M 



i=l 



1/0* 



(15) 



the expression in the square brackets is exactly Z e ((3) (just with M codewords instead of M — 1), 
because the interpretation of this expression, is that the codewords are drawn under Q regardless 
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of y. Since we are interested in (5 — * oo (in addition to the assumption that R > I(X; Y)), then we 
are actually carrying out this calculation in the glassy regime. 

The above upper bound to P c can be also written as: 



Pr < 



M 4^ 



-3d 



dev„ 



(16) 



where here Ny(d) denotes the number of codewords Xi for which — \nP{y\xi) = d, and T> n is the 
set of values that the function d(x,y) = — In P(y\x) can take on for a given y, as x exhausts the 
codebook C. Note that as d(x, y) depends only on the empirical joint distribution of x and y, then 
\T> n \ cannot exceed the number of empirical conditional distributions (or conditional type classes) 
corresponding to pairs of n-sequences, and so, \T> n \ is upper bounded by a polynomial in n. 

Now, when a random code is considered, then instead of applying Jensen's inequality for (3 > 1 
(as was done in [33]), and thereby insert the expectation operator into the square brackets, let us 
adopt another approach. Consider the following events: 

B = {C: N y (d) > exp{n[R - In \X\ + h (d/n\y)}+ + e]} for some d G £>„} , 

where [i]+ == max{0,t} and where ho(5\y) is defined as the maximum of Hq(X\Y) subject to the 
constraints that Eq{cI(X,Y)} = 5 and that Y is distributed according to Py. Also, define 

Wi = {C : min{d : N y {d) > 1} = i] , i< d (y) = nSy(R), 

where we recall that 6y(R) is the solution to the equation ho(S\y) = In \X\ — R. Note that {Wi} 
are disjoint events. Now, for (3 > (3 C (R): 

1//3' 

Y,N y (d).e- 

d€T> n 
,nR 



E < 



-,-dd 



< Pr{S} • [e nH ■ e-PyiP + 



e ne e -pd 



d<d (y) 

-Pr{w c nwjn...n w^ o(y) n B c } ■ e ~ nF ^ Y) ■ e ne//3 , 



(17) 



This inequality calls for some explanation: We are dividing the set of configurations of the RV's 
{Ny (d)}deT> n into three classes, defined by the events B and {Wi}. In the first class, correspond- 
ing to the first term on the right-hand side, {Ny(d)} fall in B, where there is at least one value 
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of d for which Ny(d) is exponentially larger (by at least e) than its expectation. We bound the 
value of [J2dev n N y( d ) " e -/M ] 1//3 , in this class, very "generously" , by the maximum possible value 
it can possibly take, that is, when all e nR codewords are at zero distance from y, but this quan- 
tity is weighted by Pr{i3}, which as is shown in the Appendix (Subsection A. 2), decays double- 
exponentially rapidly, at least as fast as e _e " e , and so this first term is negligible. The other two 
classes correspond to B c , where for all d E T> n , N y (d) does not exceed its expectation times e ne . 
Here we distinguish between two cases (corresponding to the two other classes): In one of them, 
(at least) one of the distances below the generalized GV distance do(y) = n5y{R) is populated 
by subexponentialljo many codewords. Since we are operating in the glassy regime, the dominant 
contribution to Edex> n Ny(d) ■ e - ^] 1 /^ will be due to these minimum distance codewords, and the 
weighting of the event of minimum distance d is, of course, according to PrjWd H B c }. In the other 
case, which captures most of the probability mass (since it is the typical configuration of {Ny (d)}), 
none of the distances below the generalized GV distance is populated by codewords, whereas for 
larger distances, {N y (d)} are all (within a factor of e ne ) about their expectations. In this case, 
our expression again behaves according to the glassy regime, where the generalized GV distance 
dominates the partition function. 

Now, regarding the second term, for 5 = d/n < 5y(R), 

Pr{W d n B c } < Pr{W d } < Pr{N y (d) > 1}, (18) 
where the latter expression is shown (Appendix, Subsection A. 2) to decay at the exponential rate 



8 The event £> c guarantees that there are only subexponentially many codewords at distances below do(y) 
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of e -n[ln\X\-R-ho(5\y)] _ Thug) 



E 



1//J* 



e ne e -/3n5 



VP 



+ e -nF g {Y) , e ne/f3 



6<5 Y (R) 



e n(R-ln\X\) . e n^//3 exp { n max [^(tfjy) - £]} + e "^(^) . e ™/P 

S<5y (R) 



n{R ~ lnm) ■ e ne/l3 exp{n[h (S Y {R)\y) - 6 Y {R)]} + e- nF ^> (Y) ■ e ne/f} 
n{R-\n\x\) exp { n [ ln \ X \ _ R _ 8 Y (R)]} ■ e ne ^ + e ~ nF ^ ■ e^lP 



e -nF a (Y) . e ne/(3_ 



(19) 



Since e can be chosen arbitrarily small for large n (in fact, one may let e vanish with n sufficiently 
slowly), the exponential rate of the expression under discussion is actually bounded by e~ nFg<yY \ 
Note that whenever /? > /3 C , this expression no longer depends on /?. Finally, substituting this 
bound back into the bound on P c , we get: 



Pr < 



-y 

yeyn 



-nFgOO 



e -nR . e nm^ Y [H(Y)-F g (Y)] 



-n(R-ma, XY [H(Y)-F g (Y)]) 



(20) 



This calculation can be shown to be exponentially tight: a lower bound can be obtained by confining 
the calculation to the (high probability) event Wq n W{ Pi ... Pi Wrf (y) n ^ c with the additional 
restriction that Ny(d) > E{Ny(d)} ■ e~ ne for all d > do(y) (i.e., the last term only in the above 
derivation). Note that in Arimoto's paper, where Jensen's inequality is used, the expectation of 
Ny (d)e~~P d is computed, and this actually corresponds to the paramagnetic regime (without 
the constraint Hq(X\Y) > ln|^f| — R). The resulting bound might not be exponentially tight in 
general^ Finally, the optimization m&xy[H(Y) — F g (Y)\ can be carried out explicitly, yielding 
lnJ2 y e-f^, where where f g (y) = E QpR {d(X,Y)\Y = y}. 



9 Note that although the exact reliability function (for optimum codes) for rates above capacity was established 
by Dueck and Korner [34], here we are only focusing on random codes drawn under an i.i.d. distribution. 
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We have obtained then a random coding exponent formula in terms of the free energy density 
in the glassy phase, from which we learn that the free energy density of the glassy phase plays 
a central role in the calculation the exponent of correct decoding. To obtain some insight, it is 
instructive to examine this expression in the special case of the BSC. Here, since F g = F g (Y) does 
not depend on the probability distribution of Y, we get: 

P c < e n l^2-R~F g ] 

= n[h{5 av (R))-8 GV {R) In l-(l-S GV (R)) In -L.] 

_ e -nD(6crv(m\\p) t (21) 

where for a, b S (0,1), D(a\\b) = aln| + (1 — a)ln^5§. This result has the intuitively appealing 
interpretation of the probability of the large deviations event that the channel makes n5cv (R) errors 
or less, although p > 5gv(R)), in which case the correct codeword 'penetrates' into the sphere of 
radius nScviR), whose surface is populated by the codewords that dominate the glassy phase. Of 
course, when such an event happens, the correct codeword dominates the partition function, and 
thus the decoding is correct. 

4 The Random Coding Error Exponent 

Let us now examine rates below I(X; Y). Consider Gallager's upper bound on the error probability 
for a given code [35] : 

p 

p>0. (22) 



^4EE P(y\* m )^ ■ 

m=i yey n 



!/(!+/>) 

-*- V,»|"^777'/ 
777' ^777 



^2 P{y\x m i 



The bracketed term is once again identified with Z e {(3) for = < 1, in contrast to the calcula- 
tion of P c , where we used large values of (3. For each m, let us first take only the expectation w.r.t. 
the incorrect codewords, referring to the random variables {Ny(d)}. Let this partial expectation 
be denoted by P e . We will also denote by f3. One way to carry out this calculation is to 
use the same technique as we used in the previous section, by classifying the distance spectrum 
{Ny(d)} to its various classes. However, here since we know already that the use of Jensen's 
inequality would not harm the exponential tightness [36J, it will be simpler to apply Jensen's in- 
equality (for < p < 1, that is, 0.5 </?<!) and thereby essentially carry out the calculation in 
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the paramagnetic regime. We proceed then as follows: 

1 M 

^< f EE P{y\x m f.E{ 



m=i yey n 

M 



E W e_/3d 



J/Fr 



m=l f/G.y n 



Y, E{Ny(d)} ■ e 



-3d 



=1 yey 



d£T> n 

deVn 



n[-R-ln|*|+/io(<S|J/)] . „-/?d 



m=i yey n 

Next, we take the expectation w.r.t. the correct codeword x m : Define 

F(y) =lnJ2P (3 (y\x)-ln\X\, yey. 

xex 

Then, the average error probability P e is upper bounded by 



i', < e ' 



E?=i r( W ) . e -np/3F p (/3,Y) 

yey™ 

= E exp{n[i; y r(y) - P 0F P (J3,Y)]} 
yeyn 

= exp{n • max[tf (y) + E PivMv) ~ p(3F p {f3,Y)}} 

yey 

= exp{-n • mm[pPF p (P, Y)-^ P(y)T(y) - H(Y)]}. 

y&y 



(23) 



(24) 



Note that T(y) is also related to a free energy expression, corresponding to the uniform prior over 
the entire input space X n , not only the codebook. Thus, we have two free energy expressions, one 
pertaining to the contribution of the correct codeword, and the other is related to the contributions 
of the incorrect codewords. 



In the special case of the BSC, where F p (j3,Y) and does not depend on Y and T = T(y) does 
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not depend on y, we get the exponential rate of 



mm[(3pF p (P)-r-H(Y)] 
= PpF p (P) - (ln[/ + (1 - pf) - In 2) - In 2 
= p(ln2 - R) - (1 + p)ln[/ + (1 - pf] 
= pin 2 - (1 + p) lnfp 1 ^ 1 ^ + (1 - p) 1 /(i+p)] _ p R 
= E (p)-pR 

= E (p,R) (25) 

which is, as expected, Gallager's reliability function for the BSC. The optimum choice of p depends 
on R. As is shown in [351 PP- 151-152], in the range R < In 2 — ^(pi/2), that is, pi/ 2 < 5gv(R), w e 
have p = 1, which means /3 = |. For ii € [In 2 — /i(p 1 / 2 ),ln2 — h(p)], the optimum p is in [0,1), 
and it satisfies R = In 2 — /i(j>i/(i+ p )) = In 2 — h(pp), or, equivalently, = (5gv(-R), which means 
that we move along the boundary between the the glassy phase and the paramagnetic phases of 

z e (P\y). 
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Appendix 

A.l Proof of the Concavity of Jy(/3, •) 

Let Qi and Q2 achieve Jy((3,Ri) and Jy(/3, R2), respectively. Now, let Q = aQ\ + (1 — a)Q2 for 
some a € (0, 1). First, observe that by the concavity of the conditional entropy in Qx\y f° r fixed 
Qy, we have 

H Q (X\Y) > aH Ql (X\Y) + (1 - a)H Q2 (X\Y) > In \X\ - afli - (1 - a)R 2 . 

It follows then that H Q (X\Y) - (3E Q d(X,Y) < J(fi,aRi + (1 - a)R 2 \y). But, on the other hand 

H Q {X\Y) - (3E Q d(X, Y) > a[H Ql (X\Y) - (3E Ql d{X, Y)\ + (1 - a) [Hq 2 (X\Y) - (3E Q2 d(X, Y)] 

= aJ Y (f3,Ri) + (l-a)J Y (f3,R2). (26) 

Thus, 

J Y (P,aR 1 + (1 - a)R 2 ) > aJ Y ((3,Ri) + (1 - a)J Y ((3,R2)- 
A. 2 Large Deviations Behavior of Ny(d) 
For a, b G [0, 1], consider the binary divergence 



D(o||6) = a^ + (l-a)\n^-^ 



a In — + (1 — a) In 
b 



b — a 

l + — b 



(27) 



To derive a lower bound to D(a\\b), let us use the inequality 

ln(l+z) = -ln— !— = -lnfl-^— ^ > , 
v ' 1 + x V 1 + xJ ~ 1 + 



and then 



D(a\\b) > aln? + (l-a) (& " " &) 



6 v ' l + (6 - a )/(l - 6) 



a in — + — a 




> 



a (in ^ - l) . (28) 



Now, let Ny(d) denote the number of codewords for which — lnP(y\xi) = d. As mentioned earlier, 
Ny(d) is the sum of the e nR independent binary random variables l{d(Xi,y) = d}, where the 
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probability that d(X u y) = d is exponentially b = e -n[\n\x\-h (S\y)] ^ h (6\y) being the maximum 
of Hq(X\Y) subject to the constraints that Eq{d(X, Y)} = 5, 5 = d/n, and that Y is distributed 
according to Py. The event Ny(d) > e nA , for d = 5n and A € [0, R), means that the relative 
frequency of the event l{d(Xi,y) = d} is at least a = e - n ( R - A ), Thus, by the Chernoff bound: 

Pv{N y (d) > e nA } < exp{-e nR D(e- n ( R - A ^\\e- n ^ x \- h °WyK)} 

< exp {-e ni? • e-™ ( ^ A) (n[(ln \X\-R- h (6\y) + A] - 1)} 

< exp{-e nA {n[\n\X\-R-h {5\y) + A\-l)}. (29) 

Now, for A = [R — \n\X\ + ho(5\y)]+ + e, the term in the square brackets is at least e > 0, 
and thus Pr{Ny(d) > e nA } decays double-exponentially rapidly, not slower than e -6 ™ 6 . The 
probability of the union of the (polynomially many) events {Ny(d) > e nA }dev n , which is upper 
bounded by the sum of the probabilities, is still double-exponentially small. Thus, Pr{£>} decays 
double-exponentially rapidly Now, the event {Ny(d) > 1} corresponds to the choice A = 0. 
For 5 < 5y(R), $y{R) being the solution to the equation In \X\ — R = ho(5\y), which means that 
In \X\ — R — ho(5\y) > 0, this gives an ordinary exponential decay at the rate of e~ n ^ ln ^ x ^ R ^ ho ^y^ . 
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